Deadline: Mar 14th, 23:00
Academic Integrity
This project is individual - it is to be completed on your own. If you have questions, please post your query in the APS1070 Piazza Q&A forums (the answer might be useful to others!).
Do not share your code with others, or post your work online. Do not submit code that you have not written yourself. Students suspected of plagiarism on a project, midterm or exam will be referred to the department for formal discipline for breaches of the Student Code of Conduct.
In this project we work on a stores sales dataset that reports the total daily sales for different product families sold at all the Favorita stores located in Ecuador from Oct 2016 to Aug 2017.
Please fill out the following:
Download your notebook: File -> Download .ipynb
Click on the Files icon on the far left menu of Colab
Select & upload your .ipynb file you just downloaded, and then obtain its path (right click) (you might need to hit the Refresh button before your file shows up)
execute the following in a Colab cell:
%%shell
jupyter nbconvert --to html /PATH/TO/YOUR/NOTEBOOKFILE.ipynb
An HTML version of your notebook will appear in the files, so you can download it.
Submit both `HTML` and `IPYNB` files on Quercus for grading.
This first part of the project assignment is to be completed independently from Parts 2 - 5. In this part you will be completing some coding tasks and submitting your results on Github. To access this part of the assignment and upload your answers, you will need to use Github. Please complete the following step-by-step instructions:
Create a Github account and install git for Windows or Mac:
Create a personal access token using your Github account. Go to Settings >> Developer Settings >> Personal access tokens >> Tokens (classic) and generate a new token (also classic). When creating the token make sure to fill the Note section and select the repo scope (for repository access, like pushing) and workflow (required to modify workflow files). Make sure you copy the Personal Access Token as soon as it gets generated.
https://github.com/APS1070-UofT/w24-project-3-part-1-*********
This your private repository to get this part questions and upload your answers. Copy this link to the text box below to be graded for this part.### Add the link here ###
# https://github.com/APS1070-UofT/w24-project-3-part-1-Pyoussefpour
Open Git Bash, the app you downloaded in step 0, and set your Email and username by:
git config --global user.email “<your-GitHub-email>”
git config --global user.name “<your-GitHub-username>”
Create a folder for the course on your computer and cd to that. cd means Change Directory. For example, on a Windows machine, where I have a folder on "C:\aps1070":
cd c:aps1070
Get your assignment by the link you got in step 2:
git clone https://github.com/APS1070-UofT/w24-project-3-part-1-*********
You will be asked to enter your Github username and password. Enter the username for your github account into the Username field, and paste the personal access token which you copied in step 1, into the Password field.
A new folder should be created in your directory similar to:
C:\aps1070\w24-project-3-part-1-********
This folder has an ipynb notebook which you need to manually upload to colab and answer its questions.
After you finished working on this notebook, download the notebook from colab and move it to the directory in step 7.
Replace the old notebook with the new one that has your answers. Make sure your completed notebook has the same name as the original notebook you downloaded.
To submit your work, follow:
cd <your assignment folder>
git add W24_Project_3_Part_1_git.ipynb
git commit -m "Final Submission"
git push
If you have any problem with pushing your work on GitHub you can try one of following commands:
git push --force
or
git push origin HEAD:main
Make sure your submission is ready for grading. Open the private repository link in your browser and make sure you can see your final submission with your latest changes there. Only you and the teaching team can open that link.
get_sorted_eigen(df_cov) that gets the covariance matrix of dataframe df (from step 1), and returns sorted eigenvalues and eigenvectors using np.linalg.eigh. [0.25]scree plot. [0.25]import pandas as pd
data_raw = pd.read_csv(
filepath_or_buffer='https://raw.githubusercontent.com/Sabaae/Dataset/main/TotalSalesbyFamily.csv',
index_col=0
)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
standardized_data = scaler.fit_transform(data_raw)
std_df = pd.DataFrame(standardized_data, columns=data_raw.columns)
std_df.index = data_raw.index
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
#1
cov_matrix = std_df.cov()
print("Shape of Covariance Matrix:" , cov_matrix.shape)
#2
def get_sorted_eigen(cov_mat):
eigenvalues, eigenvectors = np.linalg.eigh(cov_mat)
dec_i = eigenvalues.argsort()[::-1]
eigenvalues = eigenvalues[dec_i]
eigenvectors = eigenvectors[:,dec_i]
return eigenvalues, eigenvectors
#3
eigenvalues, eigenvectors = get_sorted_eigen(cov_matrix)
explained_var_ratio = eigenvalues / eigenvalues.sum()
cumulative_explained_var = np.cumsum(explained_var_ratio)
plt.figure(figsize=(20, 6))
plt.plot(explained_var_ratio, label='Explained Variance Ratio')
plt.plot(cumulative_explained_var, label='Cumulative Explained Variance')
plt.title('Scree Plot')
plt.xlabel('Principal Components')
plt.ylabel('Explained Variance Ratio')
plt.legend(loc='best')
plt.show()
#4
num_PC_needed = np.where(cumulative_explained_var >= 0.999)[0][0] + 1
print("Number of PC needed to cover 99.9% of the dataset:", num_PC_needed)
#5
fig, axs = plt.subplots(16,1,figsize=(20, 16*10))
x = list(std_df.columns)
axs = axs.flatten()
for i in range(16):
axs[i].xaxis.set_major_formatter(mdates.DateFormatter('%m/%d/%Y'))
axs[i].xaxis.set_major_locator(mdates.AutoDateLocator())
axs[i].plot(x,eigenvectors[i])
axs[i].set_title(f'PC{i+1}')
axs[i].set_xlabel("Date")
for label in axs[i].get_xticklabels():
label.set_rotation(45)
plt.show()
Shape of Covariance Matrix: (304, 304)
Number of PC needed to cover 99.9% of the dataset: 5
Compare the first few PCs with the rest of them. Do you see any difference in their trend?
The first 2 PCs have a similar trend of high y std values at the beginning, which starts with a little bit of noise and then stays at 0 for little to no noise, and at the end, it experiences a sudden drop. The other PCs are significantly more noisy than the first parts. This shows that the first two PCs have the most significant variation in the data; as the number of PCs increases, the PC's significance decreases, making the plots more noisy. From the Scree Plot, it could be noticed that the first few PCs had the majority of significance (based on the Expalianed Variance Ratio's sudden drop near the beginning)
Create a function that:
Plots 4 figures:
The incremental reconstruction of the original (not standardized) time-series for the specified family in a single plot. [1.5]
You should at least show 5 curves in a figure for incremental reconstruction. For example, you can pick the following (or any other combination that you think is reasonable):
Hint: you need to compute the reconstruction for the standardized time-series first, and then scale it back to the original (non-standardized form) using the StandardScaler inverse_transform help...
(df - df_reconstructed). On the x-axis, you have dates, and on the y-axis, the residual error.Test your function using the POULTRY, GROCERY I, SCHOOL AND OFFICE SUPPLIES, CELEBRATION, LAWN AND GARDEN, and FROZEN FOODS as inputs. [0.5]
from sklearn.metrics import mean_squared_error
def plot_family_figures(original_df, family_name):
#1
y=list(original_df.loc[family_name])
x = original_df.columns
plt.figure(figsize=(16, 8))
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%m/%d/%Y'))
plt.gca().xaxis.set_major_locator(mdates.AutoDateLocator())
plt.plot(x,y)
plt.xticks(rotation=45)
plt.xlabel("Date")
plt.title(f"Original time-series for {family_name}")
plt.show()
#2
scaler = StandardScaler()
standardized_data = scaler.fit_transform(original_df)
std_df = pd.DataFrame(standardized_data, columns=original_df.columns)
std_df.index = original_df.index
cov_matrix = std_df.cov()
_, eigenvectors = get_sorted_eigen(cov_matrix)
plt.figure(figsize=(16, 8))
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%m/%d/%Y'))
plt.gca().xaxis.set_major_locator(mdates.AutoDateLocator())
inc_lst = [1,2,4,5,8,16]
recon_dfs = []
for i in inc_lst:
W = eigenvectors[:,0:i]
projX = np.dot(std_df, W)
Recon = scaler.inverse_transform(np.dot(projX, W.T))
Recon_df = pd.DataFrame(Recon,columns=original_df.columns, index = original_df.index)
recon_dfs.append(Recon_df.loc[family_name])
plt.plot(Recon_df.loc[family_name], label=f"PC 1 to PC{i}")
plt.title(f"Reconstructed time-series for {family_name}")
plt.legend(loc='best')
plt.xticks(rotation=45)
plt.xlabel("Date")
plt.show()
#3
mse_values = [mean_squared_error(original_df.loc[family_name], recon_df) for recon_df in recon_dfs]
Best_res_err_i = np.argmin(mse_values)
res_error =original_df.loc[family_name] - recon_dfs[Best_res_err_i]
plt.figure(figsize=(16, 8))
plt.plot(x,res_error,label=f"PC 1 to PC{inc_lst[Best_res_err_i]}")
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%m/%d/%Y'))
plt.gca().xaxis.set_major_locator(mdates.AutoDateLocator())
plt.legend(loc='best')
plt.xticks(rotation=45)
plt.title(f"Best Residual Error for {family_name}")
plt.ylabel("Residual Error")
plt.xlabel("Date")
plt.show()
#4
rmse_errs =[]
for i in range(10):
W = eigenvectors[:,0:i]
projX = np.dot(std_df, W)
Recon = scaler.inverse_transform(np.dot(projX, W.T))
Recon_df = pd.DataFrame(Recon,columns=original_df.columns, index = original_df.index)
rmse_errs.append(mean_squared_error(original_df.loc[family_name], Recon_df.loc[family_name], squared =True))
plt.figure(figsize=(16, 8))
plt.plot(range(1,11), rmse_errs)
plt.title(f"RSME for {family_name}")
plt.xlabel("Number of PCs")
plt.ylabel("RMSE")
plt.show()
plot_family_figures(data_raw, "POULTRY")
plot_family_figures(data_raw, "GROCERY I")
plot_family_figures(data_raw, "SCHOOL AND OFFICE SUPPLIES")
plot_family_figures(data_raw, "CELEBRATION")
plot_family_figures(data_raw, "LAWN AND GARDEN")
plot_family_figures(data_raw, "FROZEN FOODS")
Modify your code in part 3 to use SVD instead of PCA. [1]
Explain if standardization or covariance computation is required for this part. Repeat part 3 and compare your PCA and SVD results. Write a function to make this comparison [0.5], and comment on the results. [0.5].
from sklearn.metrics import mean_squared_error
def SVD_PCA_CMP(original_df, family_name):
#1
y=list(original_df.loc[family_name])
x = original_df.columns
plt.figure(figsize=(16, 8))
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%m/%d/%Y'))
plt.gca().xaxis.set_major_locator(mdates.AutoDateLocator())
plt.plot(x,y)
plt.xticks(rotation=45)
plt.xlabel("Date")
plt.title(f"Original time-series for {family_name}")
plt.show()
#2
scaler = StandardScaler()
standardized_data = scaler.fit_transform(original_df)
std_df = pd.DataFrame(standardized_data, columns=original_df.columns)
std_df.index = original_df.index
cov_matrix = std_df.cov()
_, eigenvectors = get_sorted_eigen(cov_matrix)
plt.figure(figsize=(16, 8))
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%m/%d/%Y'))
plt.gca().xaxis.set_major_locator(mdates.AutoDateLocator())
inc_lst = [1,2,4,5,8,16]
##SVD##
U, S, V = np.linalg.svd(original_df)
recon_dfs_SVD = []
recon_dfs_PCA = []
for i in inc_lst:
##SVD##
U_truncated = U[:, :i]
S_truncated = np.diag(S[:i])
V_truncated = V[:i, :]
recon_data_SVD = U_truncated.dot(S_truncated).dot(V_truncated)
Recon_df_SVD = pd.DataFrame(recon_data_SVD,columns=original_df.columns, index = original_df.index)
recon_dfs_SVD.append(Recon_df_SVD.loc[family_name])
plt.plot(Recon_df_SVD.loc[family_name], label=f"elements 1 to {i}")
##PCA##
W = eigenvectors[:,0:i]
projX = np.dot(std_df, W)
recon_data_PCA = scaler.inverse_transform(np.dot(projX, W.T))
Recon_df_PCA = pd.DataFrame(recon_data_PCA,columns=original_df.columns, index = original_df.index)
recon_dfs_PCA.append(Recon_df_PCA.loc[family_name])
plt.plot(Recon_df_PCA.loc[family_name], label=f"PC 1 to PC{i}")
plt.title(f"Reconstructed time-series for {family_name} | SVD vs PCA")
plt.legend(loc='best')
plt.xticks(rotation=45)
plt.xlabel("Date")
plt.show()
#3
##SVD##
mse_values_SVD = [mean_squared_error(original_df.loc[family_name], recon_df) for recon_df in recon_dfs_SVD]
Best_res_err_i_SVD = np.argmin(mse_values_SVD)
res_error_SVD =original_df.loc[family_name] - recon_dfs_SVD[Best_res_err_i_SVD]
##PCA##
mse_values_PCA = [mean_squared_error(original_df.loc[family_name], recon_df) for recon_df in recon_dfs_PCA]
Best_res_err_i_PCA = np.argmin(mse_values_PCA)
res_error_PCA =original_df.loc[family_name] - recon_dfs_PCA[Best_res_err_i_PCA]
plt.figure(figsize=(16, 8))
plt.plot(x,res_error_SVD,label=f"elements 1 to elements{inc_lst[Best_res_err_i_SVD]}")
plt.plot(x,res_error_PCA,label=f"PC 1 to PC{inc_lst[Best_res_err_i_PCA]}")
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%m/%d/%Y'))
plt.gca().xaxis.set_major_locator(mdates.AutoDateLocator())
plt.legend(loc='best')
plt.xticks(rotation=45)
plt.title(f"Best Residual Error for {family_name} | SVD vs PCA")
plt.ylabel("Residual Error")
plt.xlabel("Date")
plt.show()
#4
rmse_errs_PCA =[]
rmse_errs_SVD =[]
for i in range(10):
##SVD##
U_truncated = U[:, :i]
S_truncated = np.diag(S[:i])
V_truncated = V[:i, :]
recon_data_SVD = U_truncated.dot(S_truncated).dot(V_truncated)
Recon_df_SVD = pd.DataFrame(recon_data_SVD,columns=original_df.columns, index = original_df.index)
rmse_errs_SVD.append(mean_squared_error(original_df.loc[family_name], Recon_df_SVD.loc[family_name], squared=True))
##PCA##
W = eigenvectors[:,0:i]
projX = np.dot(std_df, W)
recon_data_PCA = scaler.inverse_transform(np.dot(projX, W.T))
Recon_df_PCA = pd.DataFrame(recon_data_PCA,columns=original_df.columns, index = original_df.index)
rmse_errs_PCA.append(mean_squared_error(original_df.loc[family_name], Recon_df_PCA.loc[family_name], squared =True))
plt.figure(figsize=(16, 8))
plt.plot(range(1,11), rmse_errs_SVD, label="SVD")
plt.plot(range(1,11), rmse_errs_PCA, label="PCA")
plt.title(f"RSME for {family_name} | SVD vs PCA")
plt.xlabel("Number of Elements")
plt.ylabel("RMSE")
plt.legend(loc='best')
plt.show()
SVD_PCA_CMP(data_raw, "POULTRY")
SVD_PCA_CMP(data_raw, "GROCERY I")
SVD_PCA_CMP(data_raw, "SCHOOL AND OFFICE SUPPLIES")
SVD_PCA_CMP(data_raw, "CELEBRATION")
SVD_PCA_CMP(data_raw, "LAWN AND GARDEN")
SVD_PCA_CMP(data_raw, "FROZEN FOODS")
Explain if standardization or covariance computation is required for this part:
Standardization in PCA is required for the covariance matrix; however, since SVD does not use a covariance matrix, standardization for SVD is unnecessary. As can be seen from the code above, without using standardization and a covariance matrix, we were able to achieve a very good result. However, it is good practice to standardize the data for SVD es, especially if the features are on different scales, as it ensures each feature contributes equally.
Comment on the results:
In most cases, except for the first and last cases, it can be noticed that based on the RSME graphs, SVD can create better-restored data with fewer components. At the same time, as we increase the number of components (typically at around two components), the RSME of PCA significantly drops and becomes similar to SVD.
Create another dataset similar to the one provided in your handout using the raw information on average daily sales for different cities of Ecuador from 2015 to 2017 here. [1]
You need to manipulate the data to organize it in the desired format (i.e., the same format that was in previous parts). Missing values were removed such that if there was a missing value for the average sales of a particular city at a given date, that date has been completely removed from the dataset, even if the data of that specific date existed for other cities.
You are free to use any tools you like, from Excel to Python! In the end, you should have a new CSV file similar to the previous dataset. How many features does the final dataset have? How many cities are there?
Upload your new dataset (in CSV format) to your colab notebook, repeat part 4 for this dataset [1], and comment on the results [0.5]. When analyzing the cities, you may use Manta, Cuenca, Puyo, Quito, and El Carmen.
The code below helps you to upload your new CSV file to your colab session.
# load train.csv to Google Colab
from google.colab import files
uploaded = files.upload()
Saving ModAverageSalesbyCity.csv to ModAverageSalesbyCity.csv
### YOUR CODE HERE ###
import io
file_name = list(uploaded.keys())[0]
df_5 = pd.read_csv(io.BytesIO(uploaded[file_name]),index_col=0)
num_cities,num_features =df_5.shape
print(f"There are {num_cities} Cities & {num_features} Features")
display(df_5)
There are 21 Cities & 88 Features
| 2015-10-09 | 2015-11-02 | 2015-11-03 | 2015-11-06 | 2015-11-10 | 2015-11-11 | 2015-11-12 | 2015-11-27 | 2015-11-30 | 2015-12-08 | ... | 2017-05-12 | 2017-05-24 | 2017-05-26 | 2017-06-23 | 2017-07-03 | 2017-07-24 | 2017-07-25 | 2017-08-10 | 2017-08-11 | 2017-08-15 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| city | |||||||||||||||||||||
| Ambato | 31542.014000 | 27822.312000 | 36191.305010 | 23859.468978 | 20300.056004 | 20316.693000 | 19744.893993 | 24165.201980 | 25661.903004 | 24251.895000 | ... | 26937.050979 | 27009.775990 | 36976.897000 | 26296.420007 | 59496.756040 | 28180.418004 | 26133.407000 | 24809.644004 | 28385.367004 | 24250.683004 |
| Babahoyo | 11600.677000 | 12710.313000 | 17396.292990 | 11369.925000 | 12398.458010 | 8697.368000 | 8583.699000 | 9766.343000 | 11494.425000 | 12852.041000 | ... | 11381.513000 | 8884.296000 | 13790.370000 | 11434.866990 | 28131.835996 | 12222.570020 | 15244.628000 | 10547.809990 | 14356.665980 | 15469.877000 |
| Cayambe | 21402.435000 | 22231.977070 | 26304.007960 | 15074.156000 | 16676.495000 | 13585.936000 | 13317.631000 | 14113.566000 | 16876.365960 | 18877.560002 | ... | 15144.514000 | 16231.836000 | 20402.221000 | 18423.704010 | 53829.170000 | 18769.708900 | 18591.710000 | 18647.946990 | 19098.661000 | 21727.240000 |
| Cuenca | 42283.271000 | 36432.235001 | 41133.077000 | 37735.753994 | 34168.795990 | 32623.197000 | 25804.171996 | 29902.148986 | 34378.746004 | 35993.034020 | ... | 40067.292000 | 37507.410001 | 45678.021970 | 38360.226000 | 96090.773980 | 44041.960010 | 43246.158000 | 35612.462005 | 31420.124004 | 35793.972976 |
| Daule | 13496.379000 | 13666.855980 | 20319.285000 | 9503.860999 | 10714.570000 | 12459.297000 | 10064.095996 | 11258.719000 | 14520.076100 | 10984.482000 | ... | 12904.447010 | 14684.347000 | 18390.463999 | 12653.391000 | 33377.898000 | 21019.854000 | 13775.681880 | 10164.649996 | 16296.346016 | 13377.979000 |
| El Carmen | 6885.082000 | 6738.319000 | 9493.984000 | 6310.517000 | 5816.941000 | 4827.264000 | 3995.906000 | 6310.106000 | 7506.232004 | 6859.560000 | ... | 10852.887001 | 7398.540000 | 8658.114000 | 8752.575998 | 16110.150000 | 10245.246998 | 8241.906000 | 9757.816000 | 8513.834000 | 12666.858000 |
| Esmeraldas | 11227.911000 | 12568.154000 | 14271.258000 | 9793.129996 | 10186.844000 | 8659.104000 | 8634.872000 | 9270.106995 | 11533.167000 | 11315.572980 | ... | 10905.173000 | 10541.091000 | 16890.726000 | 13147.452000 | 34130.026000 | 19890.556000 | 16784.556900 | 14396.110000 | 16143.667010 | 15368.844000 |
| Guaranda | 9233.783990 | 9192.703000 | 12809.372000 | 7548.881995 | 7402.327000 | 7839.970000 | 8125.996000 | 7243.661000 | 9275.816000 | 7530.169040 | ... | 9097.972000 | 6910.838000 | 9850.477000 | 7011.985000 | 20518.218000 | 9849.937002 | 7982.132996 | 7548.927990 | 9447.868000 | 9282.187000 |
| Guayaquil | 79799.711998 | 74698.369978 | 107665.957025 | 76025.786979 | 75140.080092 | 80955.240022 | 61977.716948 | 71552.167014 | 88884.357953 | 90434.590000 | ... | 91809.729878 | 87530.682217 | 106471.622015 | 88567.186004 | 205639.769996 | 100569.026013 | 87021.361990 | 70987.349006 | 96893.858000 | 101063.805991 |
| Ibarra | 7380.097000 | 6560.456005 | 9124.198030 | 7489.676000 | 7642.239000 | 6830.743000 | 6678.518000 | 10444.105000 | 9868.717000 | 7773.694992 | ... | 6911.589000 | 7675.742000 | 8081.333000 | 6968.030010 | 18649.980000 | 7406.027000 | 7945.895995 | 7352.875000 | 5872.674000 | 7946.431000 |
| Latacunga | 14473.807990 | 11004.838000 | 20144.746000 | 12757.639995 | 14107.845010 | 9666.992990 | 10711.469010 | 11130.673000 | 13812.845010 | 14130.493000 | ... | 12533.987004 | 13891.953000 | 13543.101000 | 12543.491000 | 37245.732000 | 14685.889010 | 13017.809000 | 11846.774000 | 12960.356000 | 13234.096010 |
| Libertad | 17118.781000 | 10984.564020 | 14289.210000 | 9070.942000 | 11890.724000 | 8207.354000 | 8360.762000 | 9005.213000 | 11005.663010 | 12089.882010 | ... | 8702.012990 | 13083.886000 | 13963.088000 | 8742.605998 | 25328.004000 | 11181.118000 | 11586.005900 | 10710.354995 | 14586.856000 | 14986.342000 |
| Loja | 15863.452000 | 9770.596005 | 13822.180000 | 10005.363000 | 8840.110000 | 11665.514001 | 7404.340000 | 13061.282000 | 11185.973000 | 11483.838006 | ... | 10911.185000 | 12157.868000 | 13980.427001 | 11242.266003 | 24522.986000 | 12039.725020 | 9160.254000 | 10358.543000 | 10323.491000 | 9966.252000 |
| Machala | 19460.581000 | 18978.453996 | 29159.842990 | 20339.577000 | 20409.789000 | 18596.536010 | 16000.971000 | 19343.204000 | 24757.024000 | 21549.816000 | ... | 25693.668010 | 22343.217000 | 30137.195996 | 26268.650998 | 60198.220000 | 26665.972990 | 25369.037990 | 24516.454000 | 30478.515000 | 28950.659000 |
| Manta | 10290.942000 | 9056.484000 | 8058.467000 | 6946.495000 | 5432.825000 | 7344.095000 | 4884.690000 | 6554.121000 | 8765.552000 | 6058.863000 | ... | 33424.317000 | 35780.588990 | 47215.103100 | 31911.633000 | 69281.066000 | 31945.805000 | 26050.238000 | 27672.859990 | 42776.113000 | 26808.235000 |
| Playas | 7737.399000 | 8275.017000 | 7979.094990 | 4813.717000 | 4952.786000 | 4443.073000 | 3934.689000 | 5013.709005 | 5932.059000 | 5614.184000 | ... | 3849.725996 | 5848.015000 | 7626.652004 | 4724.148000 | 12565.460036 | 5523.291000 | 4227.066004 | 4600.743000 | 7139.339000 | 5371.156000 |
| Puyo | 21571.010010 | 5450.621000 | 8568.696005 | 4442.606004 | 3882.937996 | 4456.688000 | 3385.408000 | 3992.916003 | 4997.911002 | 3569.368000 | ... | 5593.631004 | 5143.498000 | 7273.221000 | 5469.792000 | 16891.282000 | 7647.207000 | 7077.916000 | 7881.464000 | 6350.380000 | 6917.787990 |
| Quevedo | 8490.001000 | 9386.270000 | 14402.613000 | 8066.522000 | 9110.046000 | 7002.633010 | 6504.663000 | 5846.691010 | 9901.207000 | 9979.792000 | ... | 7255.177990 | 6752.691000 | 8433.698000 | 7016.712000 | 19809.048000 | 10190.915100 | 8806.089000 | 6734.846000 | 8448.788000 | 11649.571000 |
| Quito | 467292.160052 | 410410.464874 | 566485.366976 | 350835.341093 | 309120.463962 | 385121.404283 | 248381.404064 | 342110.558056 | 399780.582929 | 362504.444810 | ... | 381046.218916 | 367852.492930 | 488308.578022 | 384543.607040 | 916164.083582 | 380045.835033 | 324623.762958 | 297292.358004 | 387405.731008 | 341655.357968 |
| Riobamba | 6946.029000 | 6808.531000 | 9434.007010 | 6258.710025 | 6849.927002 | 5300.250000 | 6125.887000 | 6301.334000 | 7523.285020 | 7087.681000 | ... | 7466.047000 | 6819.189000 | 13312.558996 | 6350.511996 | 17432.356000 | 7635.976000 | 7321.911000 | 6387.479000 | 7463.022000 | 9342.732000 |
| Santo Domingo | 28414.440008 | 28111.701994 | 38414.277000 | 26887.433999 | 25294.906004 | 24334.814000 | 19681.009996 | 21329.636000 | 28064.798000 | 26550.968000 | ... | 24526.439016 | 23770.475000 | 31168.905010 | 23169.435000 | 70418.442002 | 29747.352996 | 24940.334996 | 22778.011000 | 29211.225000 | 30309.081000 |
21 rows × 88 columns
SVD_PCA_CMP(df_5, "Manta")
SVD_PCA_CMP(df_5, "Cuenca")
SVD_PCA_CMP(df_5, "Puyo")
SVD_PCA_CMP(df_5, "Quito")
SVD_PCA_CMP(df_5, "El Carmen")
comment on the results
The results observed here are slightly different compared to the previous dataset. To begin with, the RSMEs of this dataset are significantly less than those of the previous one. Secondly, SVD didn't dominate the performance this time; with Manta, Puyo, and EL Carmen, SVD performed better, while with Cuenca and Quito, PCA performed better. However, given the low values of the residual errors, the data performed very well in reconstruction. This could be attributed to this dataset's lower number of features or higher data correlation.
Understanding PCA and SVD:
https://towardsdatascience.com/pca-and-svd-explained-with-numpy-5d13b0d2a4d8
https://hadrienj.github.io/posts/Deep-Learning-Book-Series-2.8-Singular-Value-Decomposition/
PCA:
Sales Data: